TERM - 2

INTRO TO MACHINE LEARNING

Data Science from different perspective

image.png

What is Data Science

Data science is an interdisciplinary field about scientific methods, processes and systems to extract Knowledge or insights from data in various forms, either structured or unstructured. image.png

Data Science Process

The three components involved in data science are organising, packaging and delivering data.
The 3 step OPD Data Science Process

Step 1. Organise Data.

Organising data involves the physical storage and format of data and incorporated best practices in data management.

Step 2. Package Data. 

Packaging data involves logically manipulating and joining the underlying raw data into a new representation and package.

Step 3. Deliver Data.

Delivering data involves ensuring that the message the data has is being accessed by those that need to hear it.


Intro to Machine Learning

image.png

Machine learning is the idea that there are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem.

Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data.

image.png

Types of Machine Learning Systems

There are so many different types of Machine Learning systems that it is useful to classify them in broad categories based on:

  • Whether or not they are trained with human supervision (supervised, unsupervised, semisupervised, and Reinforcement Learning)
  • Whether or not they can learn incrementally on the fly (online versus batch learning)
  • Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model, much like scientists do (instance-based versus model-based learning)

Supervised Learning

image.png

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output. The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

image.png

Supervised learning problems can be further grouped into regression and classification problems.

Classification:

A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”.

Regression:

A regression problem is when the output variable is a real value, such as “rupees” or “weight”.

  • Linear regression for regression problems,
  • Random forest for classification and regression problems,
  • Support vector machines (SVM) for classification problems.

In Machine Learning an attribute is a data type (e.g., “Mileage”), while a feature has several meanings depending on the context, but generally means an attribute plus its value (e.g., “Mileage = 15,000”). Many people use the words attribute and feature interchangeably, though.

image.png

Unsupervised Machine Learning

image.png

Unsupervised learning is where you only have input data (X) and no corresponding output variables.

The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.

image.png

Unsupervised learning problems can be further grouped into clustering and association problems.

Clustering:

A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behavior.

Association:

An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y.

image.png

Semisupervised learning

Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data.

Some photo-hosting services, such as Google Photos, are good examples of this.

image.png

Why Use Machine Learning?

image.png


image.png


image.png

To summarize, Machine Learning is great for:

  • Problems for which existing solutions require a lot of hand-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and perform better.

  • Complex problems for which there is no good solution at all using a traditional approach: the best Machine Learning techniques can find a solution.

  • Fluctuating environments: a Machine Learning system can adapt to new data.
  • Getting insights about complex problems and large amounts of data.

image.png

Reinforcement Learning

Reinforcement Learning is a very different beast. The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards

image.png

Batch and Online Learning

Another criterion used to classify Machine Learning systems is whether or not the system can learn incrementally from a stream of incoming data.

Batch learning

In batch learning, the system is incapable of learning incrementally: it must be trained using all the available data.

Online learning

In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives

image.png

Instance-Based Versus Model-Based Learning

One more way to categorize Machine Learning systems is by how they generalize.

Instance-based learning

The system learns the examples by heart, then generalizes to new cases using a similarity measure

image.png

Model-based learning

Another way to generalize from a set of examples is to build a model of these examples, then use that model to make predictions. This is called model-based learning

image.png

image.png

image.png

image.png

image.png

image.png

The Machine Learning Stack

image.png

Technologies

image.png

R

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. [4]

R-Studio

RStudio is an open source and enterprise-ready professional software for R. [5]

Python

Python is a general-purpose interpreted, dynamically typed, interactive, object-oriented, and high-level programming language. [6] There are two main versions of python in use currently, Python 2 and Python 3.

Anaconda

Anaconda is a freemium open source distribution of the Python and R programming languages. [7] It is used for large-scale data processing, predictive analytics, and scientific computing, that aims to simplify package management and deployment.

Its package management system is conda.

PyCharm

PyCharm is an Integrated Development Environment (IDE) used in computer programming, specifically for the Python language.[8]

PyCharm is cross-platform, with Windows, macOS and Linux versions.

Atom

Atom is a free and open-source, text and source code editor. [9] Available for macOS, Linux, and Microsoft Windows with support for plug-ins written in Node.js, and embedded Git Control, developed by GitHub.

Shell Script

The shell is a program that takes your commands from the keyboard and gives them to the operating system to perform

Shell is an environment in which we can run our commands, programs, and shell scripts.

Command Line

It is a means of interacting with a computer program where client communicates with a system through text based commands

PuTTy

PuTTY is a client program for the SSH, Telnet and Rlogin network protocols. These protocols are all used to run a remote session on a computer, over a network. [10]

Dask

Dask is a flexible parallel computing library for analytic computing. [11]

TensorFlow

TensorFlow™ is an open source software library for numerical computation using data flow graphs. [12]

PyTorch

PyTorch is a python package that provides two high-level features: [13]

  1. Tensor computation (like numpy) with strong GPU acceleration
  2. Deep Neural Networks built on a tape-based autograd system

Apache Spark

Apache Spark is an open-source cluster-computing framework. [14]

H2O

It is a fast statistical, machine learning & math runtime for bigdata. [15]

DL4J

Open-Source, Distributed, Deep Learning Library for the JVM [16]

IBM Watson

Watson is a question answering computer system capable of answering questions posed in natural language, developed by IBM. [17]

Cassandra

Apache Cassandra is a free and open-source distributed NoSQL database management system. [18] It is designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

Use Cases

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

References

[1] Frank Chamaki blog
[2] Victor Lavrenko
[3] Venturisity Blog